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Abstract 

Data-Oriented Parsing (dop) ranks among the best pars- 
ing schemes, pairing state-of-the art parsing accuracy to 
the psycholinguistic insight that larger chunks of syn- 
tactic structures are relevant grammatical and proba- 
bilistic units. Parsing with the DOP-model, however, 
seems to involve a lot of CPU cycles and a consider- 
able amount of double work, brought on by the concept 
of multiple derivations, which is necessary for probabilis- 
tic processing, but which is not convincingly related to a 
proper linguistic backbone. It is however possible to re- 
interpret the DOP-model as a pattern-matching model, 
which tries to maximize the size of the substructures 
that construct the parse, rather than the probability of 
the parse. By emphasizing this memory-based aspect of 
the DOP-model, it is possible to do away with multiple 
derivations, opening up possibilities for efficient Viterbi- 
style optimizations, while still retaining acceptable pars- 
ing accuracy through enhanced context-sensitivity. 

1 Introduction 

The machine learning paradigm of Memory- 
Based Learning, based on the assumption that 
new problems are solved by direct reference to 
stored experiences of previously solved prob- 
lems, has been successfully applied to a number 
of linguistic phenomena, such as part-of-speech 
tagging, NP-chunking and stress acquisition 
(consult paclcmans (1999| ) for an overview). 
To solve these particular problems, linguistic 
information needed to trigger the correct dis- 
ambiguation, is encoded in a linear feature 
value representation and presented to a mem- 



ory based learner, such as TiMBL (Daelemans 
et al., 1999|) . 



Yet, many of the intricacies of the domain of 
syntax do not translate well to a linear repre- 
sentation, so that established MBL-methods are 
necessarily limited to low-level syntactic analy- 
sis, like the aforementioned NP-chunking task. 



Data Oriented Parsing (Bod, 1999), a state- 
of-the art natural language parsing system, 
translates very well to a Memory Based Learn- 
ing context. This paper describes a re- 
interpretation of the DOP-model, in which the 
pattern-matching aspects of the model are ex- 
ploited, so that parses are analyzed by trying 
to match a new analysis to the largest possible 
substructures recorded in memory. 

A short introduction to Data Oriented Pars- 
ing will be presented in Section |2[ followed by an 
explanation of the term pattern-matching in the 
context of this paper. Section ||] describes the 
experimental setup and the corpus. The parsing 
phase that precedes the disambiguation phase 
will be outlined in Section || and a description 
of the 3 disambiguating models, pcfg, pmpg 
and the combined system pcfg+pmpg can be 
found in Sections ||, and |8[ 

2 Data Oriented Parsing 

Data Oriented Parsing, originally conceived by 
Remko Scha ( Scha, 1990| ) , has been successfully 
applied to syntactic natural language parsing 
by Rens Bod ( [1995D , ( |l999| ). The aim of Data 
Oriented Parsing (henceforth dop) is to develop 
a performance model of natural language, that 
models language use rather than some type of 
competence. It adapts the psycholinguistic in- 
sight that language users analyze sentences us- 
ing previously registered constructions and that 
not only rewrite rules, but complete substruc- 
tures of any given depth can be linguistically 
relevant units for parsing. 

2.1 Architecture 

The core of a DOP-system is its treebank: an 
annotated corpus is used to induce all substruc- 
tures of arbitrary depth, together with their re- 
spective probabilities, which is a expressed by 
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Figure 1: Multiple Derivations 



its frequency in the treebank relative to the 
number of substructures with the same root- 
node. 

Figure [l] shows the combination operation 
that is needed to form the correct parse tree 
for the sentence Peter killed a raccoon. Given a 
treebank of substructures, the system tries to 
match the leftmost open node of a substruc- 
ture that is consistent with the parse tree, with 
the top-node of another substructure, consistent 
with the parse tree. 

Usually, different combinations of substruc- 
tures are possible, as is indicated in Figure 
[l]: in the example at the left-hand side the 
tree-structure can be built by combining an s- 
structure with a specified np and a fully speci- 
fied VP-structure. The right example shows an- 
other possible combination, where a parse tree is 
built by combining the minimal substructures. 
Note that these are consistent with ordinary 
rewrite-rules, such as S — ► np VP. 

One particular parse tree may thus consist of 
several different derivations. To find the prob- 
ability of a derivation, we multiply the proba- 
bilities of the substructures that were used to 
form the derivation. To find the probability of 
a parse, we must in principle sum the probabil- 
ities of all its derivations. 

It is computationally hardly tractable to con- 
sider all derivations for each parse. Since 
VITERBI optimization only succeeds in finding 
the most probable derivation as opposed to the 
most probable parse, the monte carlo al- 
gorithm is introduced as a proper approxima- 
tion that randomly generates a large number of 
derivations. The most probable parse is consid- 
ered to be the parse that is most often observed 
in this derivation forest. 

2.2 Experimental Results of dop 

The basic DOP-model, dopI, was tested on 
a manually edited version of the ATIS-corpus 



( Marcus, Santorini, and Marcinkiewicz, 1993|) . 
The system was trained on 603 sentences (part- 



of-speech tag sequences) and evaluated on a test 
set of 75 sentences. Parse accuracy was used as 
an evaluation metric, expressing the percentage 
of sentences in the test set for which the parse 
proposed by the system is completely identi- 
cal to the one in the original corpus. Differ- 
ent experiments were conducted in which maxi- 
mum substructure size was varied. With DOPl- 
limited to a substructure-size of 1 (equivalent 
to a pcfg), parse accuracy is 47%. In the op- 
timal DOP-model, in which substructure-size is 
not limited, a parse accuracy of 85% is ob- 
tained. 

2.3 Short Assessment of DOP 

dopI in its optimal form achieves a very high 
parse accuarcy. The computational costs of the 
system, however, are equally high. Bod (1995| ) 
reported an average parse time of 3.5 hours per 
sentence. Even though current parse time is 
reported to be more reasonable, the optimal 
DOP algorithm in which no constrains are made 
on the size of substructures, may not yet be 
tractable for life-size corpora. 

In a context-free grammar framework (con- 
sistent with DOP limited to a substructure-size 
of 1), there is only one way a parse tree can 
be formed (for example, the right hand side of 
Figure |l|) , meaning that there is only one deriva- 
tion for a given parse tree. This allows efficient 
viterbi style optimization. 

To encode context-sensitivity in the system, 
DOP is forced to introduce multiple derivations, 
so that repeatedly the same parse tree needs to 
be generated, bringing about a lot of computa- 
tional overhead. 

Even though the use of larger syntactic con- 
texts is highly relevant from a psycholinguistic 
point-of-view, there is no explicit preference be- 
ing made for larger substructures in the dop 
model. While the Monte Carlo optimization 
scheme maximizes the probability of the deriva- 
tions and seems to prefer derivations made up 
of larger substructures, it may be interesting to 
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% 
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71.5 
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Table 1: Experimental Results 
(a) Correct Analysis (b) PCFG- Analysis 
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Figure 2: PCFG Error Analysis 



see if we can make this assumption explicit. 

3 Pattern-matching 

When we look at natural language parsing from 
a memory-based point of view, one might say 
that a sentence is analyzed by looking up the 
most similar structure for the different analy- 
ses of that sentence in memory. The parsing 
system described in this paper tries to mimic 
this behavior by interpreting the DOP-model as 
a memory-based model, in which analyses are 
being matched with syntactic patterns recorded 
in memory. Similarity between the proposed 
analysis and the patterns in memory is com- 
puted according to: 

• the number of patterns needed to construct 
a tree (to be minimized) 

• the size of the patterns that are used to 
construct a tree (to be maximized) 

The nearest neighbor for a given analysis can 
be defined as the derivation that shares the 
largest amount of common nodes. 

4 The experimental Setup 

10-fold cross-validation was used to appropri- 
ately evaluate the algorithms, as the dataset 
(see Section |Q| ) is rather small. Like DOPl the 
system is trained and tested on part-of-speech 



tag sequences. In a first phase, a simple bottom- 
up chart parser, trained on the training parti- 
tions, was used to generate parse forests for the 
part-of-speech tag sequences of the test parti- 
tion. Next, the parse forests were sent to the 3 
algorithms (henceforth the disambiguators) to 
order these parse forests, the first parse of the 
ordered parse forest being the one proposed by 
the disambiguator. 

In this paper, 3 disambiguators are described: 

• PCFG: simple Probabilistic Context-Free 
Grammar 

• PMPG: the DOP approximation, Pattern- 
Matching Probabilistic Grammar 

• pcfg+pmpg: a combined system, inte- 
grating PCFG and PMPG 

The evaluation metric used is parse accuracy, 
but also the typical parser evaluation metric F- 
measure (precision/recall) is given as a means 
of reference to other systems. 

4.1 The Corpus 

The experiments were conducted on an edited 



version of the ATIS-II-corpus ( Marcus, San- 



torini, and Marcinkiewicz, 1993 ), which con- 
sists of 578 sentences. Quite a lot of errors and 
inconsistencies were found, but not corrected, 
since we want our (probabilistic) system to be 



able to deal with this kind of noise. Seman- 
tically oriented flags like -tmp and -DIR, most 
often used in conjunction with pp, have been 
removed, since there is no way of retrieving this 
kind of semantic information from the part-of- 
speech tags of the ATIS-corpus. Syntactic flags 
like -SB J, on the other hand, have been main- 
tained. Internal relations (denoted by numeric 
flags) were removed and for practical reasons, 
sentence-length was limited to 15 words max. 
The edited corpus retained 562 sentences. 

5 Parsing 

As a first phase, a bottom-up chart parser 
parsed the test set. This proved to be quite 
problematic, since overall, 106 out of 562 sen- 
tences (19%) could not be parsed, due to the 
sparseness of the grammar, meaning that the 
appropriate rewrite rule needed to construct the 
correct parse tree for a sentence in the test set, 
wasn't featured in the induced grammar. NP- 
annotation seemed to be the main cause for un- 
parsability. An np like restriction code AP/57 
is represented by the rewrite rule: 
NP -> NN NN sym sym sym CD CD 

Highly specific and flat structures like these 
are scarce and are usually not induced from the 
training set when needed to parse the test set. 

On-going research tries to implement gram- 
matical smoothing as a solution to this problem, 
but one might also consider generating parse 
forests with an independent grammar, induced 
from the entire corpus (training set+testset) or 
a different corpus. In both cases, however, we 
would need to apply probabilistic smoothing to 
be able to assign probabilities to unknown struc- 
tures/rules. Neither grammatical, nor proba- 
bilistic smoothing was implemented in the con- 
text of the experiments, described in this paper. 

The sparseness of the grammar proves to be 
a serious bottleneck for parse accuracy, limiting 
our disambiguators to a maximum parse accu- 
racy of 81%. 

6 PCFG-experiments 

A pcfg constructs parse trees by using simple 
rewrite-rules. The probability of a parse tree 
can be computed by multiplying the probabili- 
ties of the rewrite-rules that were used to con- 
struct the parse. Note that a pcfg is identical 



to dopI when we limit the maximum substruc- 
tures size to 1, only allowing derivations of the 
type found at the right-hand side of Figure EL 

6.1 Experimental Results 

The first line of Table H shows the results for the 
PCFG-experiments: 66.4% parse accuracy is an 
adequate result for this baseline model. We also 
look at parse accuracy for parsable sentences 
(an estimate of the parse accuracy we might 
get if we had a more suited parse forest gener- 
ator) and we notice that we are able to achieve 
a 81.8% parse accuracy. This is already quite 
high, but on examining the parsed data, serious 
and fundamental limitations to the PCFG-model 
can be observed 

6.2 Error Analysis 

Figure |2], displays the most common type of mis- 
take made by pcfg's. The correct parse tree 
could represent an analysis for the sentence: 

I want a flight from Brussels to Toronto. 

This example shows that a PCFG has a ten- 
dency to prefer flatter structures over embedded 
structures. This is a trivial effect of the mathe- 
matical formula used to compute the probabil- 
ity of a parse-tree: embedded structure require 
more rewrite rules, adding more factors to the 
multiplication, which will almost inevitably re- 
sult in a lower probability. 

It is an unfortunate property of pcfg's that 
the number of nodes in the parse tree is inversely 
proportionate to its probability. One might be 
inclined to normalize a parse tree's probability 
relative to the number of nodes in the tree, but a 
more linguistically sound alternative is at hand: 
the enhancement of context sensitivity through 
the use of larger syntactic context within parse 
trees can make our disambiguator more robust. 

7 PMPG-experiments 

The Pattern-Matching Probabilistic Grammar 
is a memory-based interpretation of a DOP- 
model, in which a sentence is analyzed by 
matching the largest possible chunks of syn- 
tactic structure on the sentence. To compile 
parse trees into patterns, all substructures in 
the training set are encoded by assigning them 
specific indexes, NP@345 e.g. denoting a fully 
specified NP-structure. This approach was in- 
spired by Goodman (1996j ), in which Goodman 



unsuccessfully uses a system of indexed parse 
trees to transform DOP into an equivalent pcfg. 
The system of indexing (which is detailed in [De| 
Pauw (2000D ) used in the experiments described 
in this paper, is however specifically geared to- 
wards encoding contextual information in parse 
trees. 

Given an indexed training set, indexes can 
then be matched on a test set parse tree in a 
bottom-up fashion. In the following example, 
boxed nodes indicate nodes that have been re- 
trieved from memory. 

S 




nnp 

In this example we can see that an np, con- 
sisting of a fully specified embedded np and 
pp, has been completely retrieved from mem- 
ory, meaning that the NP in its entirety can 
be observed in the training set. However, no 
VP was found that consists of a vbp and that 
particular np. Disambiguating with pmpg con- 
sequently involves pruning all nodes retrieved 
from memory: 
S 



NP-SBJ 



VP 



vbp NP 

Finally, the probability for this pruned parse 
tree is computed in a PCFG-type manner, not 
adding the retrieved nodes to the product: 

P(parse) = P(s — ► np-sbj vp) . P(vp —* vbp np) 

7.1 Experimental Results 

The results for the PMPG-experiments can be 
found on the second line of Table |l[ On some 
partitions, PMPG performed insignificantly bet- 
ter than PCFG, but Table [l] shows that the re- 
sults for the context sensitive scheme are much 
worse. 58.2% overall parse accuracy and 71.7% 
parse accuracy on parsable sentences indicates 
that PMPG is not a valid approximation of dop's 
context-sensitivity. 



7.2 Error Analysis 

The dramatic drop in parsing accuracy calls for 
an error analysis of the parsed data. Figure || 
is a prototypical mistake pmpg has made. The 
correct analysis could represent a parse tree for 
a sentence like: 

What flights can I get from Brussels to Toronto. 

The pmpg analysis would never have been 
considered a likely candidate by a common 
pcfg. This particular sentence in fact was ef- 
fortlessly disambiguated by the PCFG . Yet 
the fact that large chunks of tree-structure are 
retrieved from memory, make it the preferred 
parse for the pmpg. We notice for instance that 
a large part of the sentence can be matched 
on an SBAR structure, which has no relevance 
whatsoever. 

Clearly, pmpg overestimates substructure 
size as a feature for disambiguation. It's inter- 
esting however to see that it is a working imple- 
mentation of context sensitivity, eagerly match- 
ing patterns from memory. At the same time, it 
has lost track of common-sense pcfg tactics. It 
is in the combination of the two that one may 
find a decent disambiguator and accurate im- 
plementation of context-sensitivity. 

8 A Combined System (pmpg+pcfg) 

Table |l] showed that 81.8% of the time, a pcfg 
finds the correct parse (for parsable sentences), 
meaning that the correct parse is at the first 
place in the ordered parse forest. 99% of the 
time, the correct parse can be found among the 
10 most probable parses in the ordered parse 
forest. This opens up a myriad of possibili- 
ties for optimization. One might for instance 
use a best-first strategy to generate only the 10 
best parses, significantly reducing parse and dis- 
ambiguation time. An optimized disambiguator 
might therefore include a preparatory phase in 
which a common-sense PCFG retains the most 
probable parses, so that a more sophisticated 
follow-up scheme need not bother with sense- 
less analyses. 

In our experiments, we combined the 
common-sense logic of a pcfg and used its 
output as the pmpg's input. This is a well- 
established technique usually referred to as sys- 
tem combination (see van Halteren, Zavrel, and 
Daclcmans (1998|) for an application of this 



technique to part-of-speech tagging): 



sentences 



PCFG 



n most probable parses 



PMPG 



most probable parse 



P(parse) 



We are also presented with the possibility to 
assign a weight to each algorithm's decision. 
The probability of a parse can the be described 
with the following formula: 

FJ P(rewrite-rule)i 

i 

(# non-indexed nodes)" 

The weight of each algorithm's decision, as 
well as the number of most probable parses that 
are extrapolated for the pattern-matching algo- 
rithm, are parameters to be optimized. Future 
work will include evaluation on a validation set 
to retrieve the optimal values for these param- 
eters. 

8.1 Results 

The third line in Table [l] shows that the com- 
bined system performs better than either one, 
with a parse accuracy of 71.5% and close to 90% 
parse accuracy on parsable sentences, which we 
can consider an approximation of results re- 
ported for dopI. Error analysis shows that 
the combined system is indeed able to overcome 
difficulties of both algorithms. The example 
in Figure [2] as well as the example in Figure 
H were disambiguated correctly using the com- 
bined system 

9 Future Research 

Even though the pmpg shows a lot of promise 
in its parse accuracy, the following extensions 
need to be researched: 

• Optimizing pmpg+pcfg for computa- 
tional efficiency: the graph in Section || 
shows a possible optimized parsing system, 
in which a pre-processing pcfg generates 
the n most likely candidates to be extrap- 
olated for the actual disambiguator. Full 
parse forests were generated for the exper- 
iments described in this paper, so that the 



efficiency gain of such a system cannot be 
properly estimated. 

pmpg+pcfg as an approximation needs to 
be compared to actual DOP, by having DOP 
parse the data used in this experiment, and 
by having PMPG+PCFG parse t he data use d 
in the experiments described in Bod (199S| ). 



• The bottleneck of the sparse grammar 
problem prevents us from fully exploiting 
the disambiguating power of the pattern- 
matching algorithm. The GRAEL-system 
(GRammar Adaptation, Evolution and 
Learning) that is currently being devel- 
oped, tries to address the problem of gram- 
matical sparseness by using evolutionary 
techniques to generate, optimize and com- 
plement grammars. 

10 Conclusions 

Even though dopI exhibits outstanding pars- 
ing behavior, the efficiency of the model is 
rather problematic. The introduction of mul- 
tiple derivations causes a considerable amount 
of computational overhead. Neither is it clear 
how the concept of multiple derivations trans- 
lates to a psycholinguistic context: there is no 
proof that language users consider different in- 
stantiations of the same parse, when deciding 
on the correct analysis for a given sentence. 

A pattern-matching scheme was presented 
that tried to disambiguate parse forests by 
trying to maximize the size of the substruc- 
tures that can be retrieved from memory. 
This straightforward memory-based interpreta- 
tion yields sub-standard parsing accuracy. But 
the combination of common-sense probabili- 
ties and enhanced context-sensitivity provides 
a workable parse forest disambiguator, indicat- 
ing that language users might exert a complex 
combination of memory-based recollection tech- 
niques and stored statistical data to analyze ut- 
terances. 
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